Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 9 de 9
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
BMC Bioinformatics ; 23(1): 544, 2022 Dec 16.
Artigo em Inglês | MEDLINE | ID: mdl-36526957

RESUMO

BACKGROUND: The Basic Local Alignment Search Tool (BLAST) is a suite of commonly used algorithms for identifying matches between biological sequences. The user supplies a database file and query file of sequences for BLAST to find identical sequences between the two. The typical millions of database and query sequences make BLAST computationally challenging but also well suited for parallelization on high-performance computing clusters. The efficacy of parallelization depends on the data partitioning, where the optimal data partitioning relies on an accurate performance model. In previous studies, a BLAST job was sped up by 27 times by partitioning the database and query among thousands of processor nodes. However, the optimality of the partitioning method was not studied. Unlike BLAST performance models proposed in the literature that usually have problem size and hardware configuration as the only variables, the execution time of a BLAST job is a function of database size, query size, and hardware capability. In this work, the nucleotide BLAST application BLASTN was profiled using three methods: shell-level profiling with the Unix "time" command, code-level profiling with the built-in "profiler" module, and system-level profiling with the Unix "gprof" program. The runtimes were measured for six node types, using six different database files and 15 query files, on a heterogeneous HPC cluster with 500+ nodes. The empirical measurement data were fitted with quadratic functions to develop performance models that were used to guide the data parallelization for BLASTN jobs. RESULTS: Profiling results showed that BLASTN contains more than 34,500 different functions, but a single function, RunMTBySplitDB, takes 99.12% of the total runtime. Among its 53 child functions, five core functions were identified to make up 92.12% of the overall BLASTN runtime. Based on the performance models, static load balancing algorithms can be applied to the BLASTN input data to minimize the runtime of the longest job on an HPC cluster. Four test cases being run on homogeneous and heterogeneous clusters were tested. Experiment results showed that the runtime can be reduced by 81% on a homogeneous cluster and by 20% on a heterogeneous cluster by re-distributing the workload. DISCUSSION: Optimal data partitioning can improve BLASTN's overall runtime 5.4-fold in comparison with dividing the database and query into the same number of fragments. The proposed methodology can be used in the other applications in the BLAST+ suite or any other application as long as source code is available.


Assuntos
Metodologias Computacionais , Software , Algoritmos , Biologia Computacional/métodos , Alinhamento de Sequência
2.
Front Artif Intell ; 4: 711467, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-34409286

RESUMO

Drug labeling contains an 'INDICATIONS AND USAGE' that provides vital information to support clinical decision making and regulatory management. Effective extraction of drug indication information from free-text based resources could facilitate drug repositioning projects and help collect real-world evidence in support of secondary use of approved medicines. To enable AI-powered language models for the extraction of drug indication information, we used manual reading and curation to develop a Drug Indication Classification and Encyclopedia (DICE) based on FDA approved human prescription drug labeling. A DICE scheme with 7,231 sentences categorized into five classes (indications, contradictions, side effects, usage instructions, and clinical observations) was developed. To further elucidate the utility of the DICE, we developed nine different AI-based classifiers for the prediction of indications based on the developed DICE to comprehensively assess their performance. We found that the transformer-based language models yielded an average MCC of 0.887, outperforming the word embedding-based Bidirectional long short-term memory (BiLSTM) models (0.862) with a 2.82% improvement on the test set. The best classifiers were also used to extract drug indication information in DrugBank and achieved a high enrichment rate (>0.930) for this task. We found that domain-specific training could provide more explainable models without performance sacrifices and better generalization for external validation datasets. Altogether, the proposed DICE could be a standard resource for the development and evaluation of task-specific AI-powered, natural language processing (NLP) models.

3.
Front Pharmacol ; 12: 608778, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-33967751

RESUMO

High-risk neuroblastoma (NB) remains a significant therapeutic challenge facing current pediatric oncology patients. Structural variants such as gene fusions have shown an initial promise in enhancing mechanistic understanding of NB and improving survival rates. In this study, we performed a comprehensive in silico investigation on the translational ability of gene fusions for patient stratification and treatment development for high-risk NB patients. Specifically, three state-of-the-art gene fusion detection algorithms, including ChimeraScan, SOAPfuse, and TopHat-Fusion, were employed to identify the fusion transcripts in a RNA-seq data set of 498 neuroblastoma patients. Then, the 176 high-risk patients were further stratified into four different subgroups based on gene fusion profiles. Furthermore, Kaplan-Meier survival analysis was performed, and differentially expressed genes (DEGs) for the redefined high-risk group were extracted and functionally analyzed. Finally, repositioning candidates were enriched in each patient subgroup with drug transcriptomic profiles from the LINCS L1000 Connectivity Map. We found the number of identified gene fusions was increased from clinical the low-risk stage to the high-risk stage. Although the technical concordance of fusion detection algorithms was suboptimal, they have a similar biological relevance concerning perturbed pathways and regulated DEGs. The gene fusion profiles could be utilized to redefine high-risk patient subgroups with significant onset age of NB, which yielded the improved survival curves (Log-rank p value ≤ 0.05). Out of 48 enriched repositioning candidates, 45 (93.8%) have antitumor potency, and 24 (50%) were confirmed with either on-going clinical trials or literature reports. The gene fusion profiles have a discrimination power for redefining patient subgroups in high-risk NB and facilitate precision medicine-based drug repositioning implementation.

5.
Front Pharmacol ; 11: 927, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-32676024

RESUMO

Noonan and LEOPARD syndromes (NS and LS) belong to a group of related disorders called RASopathies characterized by abnormalities of multiple organs and systems including hypertrophic cardiomyopathy and dysmorphic facial features. There are no approved drugs for these two rare diseases, but it is known that a missense mutation in PTPN11 genes is associated with approximately 50% and 70% of NS and LS cases, respectively. In this study, we implemented a hybrid computational drug repositioning framework by integrating transcriptomic and structure-based approaches to explore potential treatment options for NS and LS. Specifically, disease signatures were derived from the transcriptomic profiles of human induced pluripotent stem cells (iPSCs) from NS and LS patients and reverse correlated to drug transcriptomic signatures from CMap and L1000 projects on the basis that if disease and drug transcriptomic signatures are reversely correlated, the drug has the potential to treat that disease. The compounds that were ranked top based on their transcriptomic profiles were docked to mutated and wild-type 3D structures of PTPN11 by an adjusted Induced Fit Docking (IFD) protocol. In addition, we prioritized repositioned candidates for NS and LS by a consensus ranking strategy. Network analysis and phenotypic anchoring of the transcriptomic data could discriminate the two diseases at the molecular level. Furthermore, the adjusted IFD protocol was able to recapitulate the binding specificity of potential drug candidates to mutated 3D structures, revealing the relevant amino acids. Importantly, a list of potential drug candidates for repositioning was identified including 61 for NS and 43 for LS and was further verified from literature reports and on-going clinical trials. Altogether, this hybrid computational drug repositioning approach has highlighted a number of drug candidates for NS and LS and could be applied to identifying drug candidates for other diseases as well.

6.
Drug Discov Today ; 24(1): 9-15, 2019 01.
Artigo em Inglês | MEDLINE | ID: mdl-29902520

RESUMO

Drug-induced rhabdomyolysis (DIR) is an idiosyncratic and fatal adverse drug reaction (ADR) characterized in severe muscle injuries accompanied by multiple-organ failure. Limited knowledge regarding the pathophysiology of rhabdomyolysis is the main obstacle to developing early biomarkers and prevention strategies. Given the lack of a centralized data resource to curate, organize, and standardize widespread DIR information, here we present a Drug-Induced Rhabdomyolysis Atlas (DIRA) that provides DIR-related information, including: a classification scheme for DIR based on drug labeling information; postmarketing surveillance data of DIR; and DIR drug property information. To elucidate the utility of DIRA, we used precision dosing, concomitant use of DIR drugs, and predictive modeling development to exemplify strategies for idiosyncratic ADR (IADR) management.


Assuntos
Rabdomiólise/induzido quimicamente , Rabdomiólise/classificação , Animais , Interações Medicamentosas , Rotulagem de Medicamentos , Humanos , Internet , Vigilância de Produtos Comercializados , Rabdomiólise/prevenção & controle
7.
mSphere ; 3(2)2018.
Artigo em Inglês | MEDLINE | ID: mdl-29564396

RESUMO

Detection of distantly related viruses by high-throughput sequencing (HTS) is bioinformatically challenging because of the lack of a public database containing all viral sequences, without abundant nonviral sequences, which can extend runtime and obscure viral hits. Our reference viral database (RVDB) includes all viral, virus-related, and virus-like nucleotide sequences (excluding bacterial viruses), regardless of length, and with overall reduced cellular sequences. Semantic selection criteria (SEM-I) were used to select viral sequences from GenBank, resulting in a first-generation viral database (VDB). This database was manually and computationally reviewed, resulting in refined, semantic selection criteria (SEM-R), which were applied to a new download of updated GenBank sequences to create a second-generation VDB. Viral entries in the latter were clustered at 98% by CD-HIT-EST to reduce redundancy while retaining high viral sequence diversity. The viral identity of the clustered representative sequences (creps) was confirmed by BLAST searches in NCBI databases and HMMER searches in PFAM and DFAM databases. The resulting RVDB contained a broad representation of viral families, sequence diversity, and a reduced cellular content; it includes full-length and partial sequences and endogenous nonretroviral elements, endogenous retroviruses, and retrotransposons. Testing of RVDBv10.2, with an in-house HTS transcriptomic data set indicated a significantly faster run for virus detection than interrogating the entirety of the NCBI nonredundant nucleotide database, which contains all viral sequences but also nonviral sequences. RVDB is publically available for facilitating HTS analysis, particularly for novel virus detection. It is meant to be updated on a regular basis to include new viral sequences added to GenBank. IMPORTANCE To facilitate bioinformatics analysis of high-throughput sequencing (HTS) data for the detection of both known and novel viruses, we have developed a new reference viral database (RVDB) that provides a broad representation of different virus species from eukaryotes by including all viral, virus-like, and virus-related sequences (excluding bacteriophages), regardless of their size. In particular, RVDB contains endogenous nonretroviral elements, endogenous retroviruses, and retrotransposons. Sequences were clustered to reduce redundancy while retaining high viral sequence diversity. A particularly useful feature of RVDB is the reduction of cellular sequences, which can enhance the run efficiency of large transcriptomic and genomic data analysis and increase the specificity of virus detection.

8.
BMC Bioinformatics ; 18(Suppl 14): 501, 2017 12 28.
Artigo em Inglês | MEDLINE | ID: mdl-29297287

RESUMO

BACKGROUND: Recent breakthroughs in molecular biology and next generation sequencing technologies have led to the expenential growh of the sequence databases. Researchrs use BLAST for processing these sequences. However traditional software parallelization techniques (threads, message passing interface) applied in newer versios of BLAST are not adequate for processing these sequences in timely manner. METHODS: A new method for array job parallelization has been developed which offers O(T) theoretical speed-up in comparison to multi-threading and MPI techniques. Here T is the number of array job tasks. (The number of CPUs that will be used to complete the job equals the product of T multiplied by the number of CPUs used by a single task.) The approach is based on segmentation of both input datasets to the BLAST process, combining partial solutions published earlier (Dhanker and Gupta, Int J Comput Sci Inf Technol_5:4818-4820, 2014), (Grant et al., Bioinformatics_18:765-766, 2002), (Mathog, Bioinformatics_19:1865-1866, 2003). It is accordingly referred to as a "dual segmentation" method. In order to implement the new method, the BLAST source code was modified to allow the researcher to pass to the program the number of records (effective number of sequences) in the original database. The team also developed methods to manage and consolidate the large number of partial results that get produced. Dual segmentation allows for massive parallelization, which lifts the scaling ceiling in exciting ways. RESULTS: BLAST jobs that hitherto failed or slogged inefficiently to completion now finish with speeds that characteristically reduce wallclock time from 27 days on 40 CPUs to a single day using 4104 tasks, each task utilizing eight CPUs and taking less than 7 minutes to complete. CONCLUSIONS: The massive increase in the number of tasks when running an analysis job with dual segmentation reduces the size, scope and execution time of each task. Besides significant speed of completion, additional benefits include fine-grained checkpointing and increased flexibility of job submission. "Trickling in" a swarm of individual small tasks tempers competition for CPU time in the shared HPC environment, and jobs submitted during quiet periods can complete in extraordinarily short time frames. The smaller task size also allows the use of older and less powerful hardware. The CDRH workhorse cluster was commissioned in 2010, yet its eight-core CPUs with only 24GB RAM work well in 2017 for these dual segmentation jobs. Finally, these techniques are excitingly friendly to budget conscious scientific research organizations where probabilistic algorithms such as BLAST might discourage attempts at greater certainty because single runs represent a major resource drain. If a job that used to take 24 days can now be completed in less than an hour or on a space available basis (which is the case at CDRH), repeated runs for more exhaustive analyses can be usefully contemplated.


Assuntos
Algoritmos , Biologia Computacional/métodos , Bases de Dados de Ácidos Nucleicos , Humanos , Ferramenta de Busca , Software
9.
Nucleic Acids Res ; 39(Web Server issue): W492-8, 2011 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-21558322

RESUMO

Identifying new indications for existing drugs (drug repositioning) is an efficient way of maximizing their potential. Adverse drug reaction (ADR) is one of the leading causes of death among hospitalized patients. As both new indications and ADRs are caused by unexpected chemical-protein interactions on off-targets, it is reasonable to predict these interactions by mining the chemical-protein interactome (CPI). Making such predictions has recently been facilitated by a web server named DRAR-CPI. This server has a representative collection of drug molecules and targetable human proteins built up from our work in drug repositioning and ADR. When a user submits a molecule, the server will give the positive or negative association scores between the user's molecule and our library drugs based on their interaction profiles towards the targets. Users can thus predict the indications or ADRs of their molecule based on the association scores towards our library drugs. We have matched our predictions of drug-drug associations with those predicted via gene-expression profiles, achieving a matching rate as high as 74%. We have also successfully predicted the connections between anti-psychotics and anti-infectives, indicating the underlying relevance of anti-psychotics in the potential treatment of infections, vice versa. This server is freely available at http://cpi.bio-x.cn/drar/.


Assuntos
Reposicionamento de Medicamentos , Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos/prevenção & controle , Software , Anti-Infecciosos/efeitos adversos , Anti-Infecciosos/uso terapêutico , Antipsicóticos/efeitos adversos , Antipsicóticos/uso terapêutico , Humanos , Hipoglicemiantes/efeitos adversos , Hipoglicemiantes/uso terapêutico , Internet , Ligantes , Preparações Farmacêuticas/química , Proteínas/química , Rosiglitazona , Tiazolidinedionas/efeitos adversos , Tiazolidinedionas/uso terapêutico
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...